[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide#13945
[SPARK-16256][SQL][STREAMING] Added Structured Streaming Programming Guide#13945tdas wants to merge 4 commits intoapache:masterfrom
Conversation
|
Test build #61380 has finished for PR 13945 at commit
|
| [Dataset/DataFrame API](sql-programming-guide.html) in Scala, Java or Python to express streaming | ||
| aggregations, event-time windows, stream-to-batch joins, etc. The computation | ||
| is executed on the same optimized Spark SQL engine. Finally, the system | ||
| ensures end-to-end exactly-once fault-tolerance guarantees through |
There was a problem hiding this comment.
End-to-end exactly-once sounds like over-promising. Should probably define what the ends are, because destructive outputs can't be literally exactly-once in the face of network failures.
There was a problem hiding this comment.
ensures --> can ensure
|
|
||
| - `version` and `partition` are two parameter in the `open` that uniquely represents a set of rows that needs to be pushed out. `version` is monotonically increasing id that increases with every trigger. `partition` is an id that represents a partition of the output, since the output is distributed and will be processed on multiple executors. | ||
|
|
||
| - `open` can use the `version` and `partition` to choose whether it need to write the sequence of rows. Accordingly, it can return `true` (proceed with writing), or `false` (no need to write). If the `false` is returned, then `write` will not be called on any row. For example, after a partial failure, so partitions of the failed trigger may have already been committed to a database. Based on metadata stores in the database, the writer can identify partitions that have already been committed and |
There was a problem hiding this comment.
"whether it need" -> "whether it needs"
"If the false" -> "If false"
"so partitions" -> "some partitions"
"been committed and" ...? the end of this bullet seems to be missing
There was a problem hiding this comment.
whether it need to write => whether it needs to write
If the false is returned => If false is returned
partitions that have already been committed and => incomplete sentence?
|
Thank you very much everyone for the detailed review! I am really thankful you caught so many issues that I missed in my first pass. I have addressed your comments as well as more comments I have received offline. In the interest of Spark 2.0 release, I am going to prioritize merging this PR. If there are outstanding issues, lets solve them in follow up PRs. I am sure that we can improve this draft by a lot with everyone's contributions. |
|
Test build #61455 has finished for PR 13945 at commit
|
|
|
||
| Now consider what happens if one of the events arrives late to the application. | ||
| For example, a word that was generated at 12:04 but it was received at 12:11. | ||
| Since this windowing is based on the time in the data, the time 12:04 should considered for windowing. This occurs naturally in our window-based grouping --the late data is automatically placed in the proper windows and the correct aggregates updated as illustrated below. |
There was a problem hiding this comment.
Couple of minor corrections.
- the time 12:04, should be considered for windowing.
- grouping - the late data
|
@ScrapCodes thanks for catching those. I will update them in a follow up PR. I am merging this as is to master and 2.0 in the interest of making it to Spark 2.0 RC2 |
…Guide Title defines all. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #13945 from tdas/SPARK-16256. (cherry picked from commit 64132a1) Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
|
I have opened up another PR #13978 with left over fixes. |
Title defines all.